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We consider the statistical properties of primary sequences of two-letter HP copolymers {H for hy- 
dropiiobic and P for polar) designed to have water soluble globular conformations with H monomers 
shielded from water inside the shell of P monomers. We show, both by computer simulations and 
by exact analytical calculation, that for large globules and flexible polymers such sequences exhibit 
long-range correlations which can be described by Levy-flight statistics. 
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PACS numbers: 61.41. -fe, 36.20.Ey, 87.15. Cc 

In refs , a new approach to the design of specific 
primary sequences for the 77P-copolymers consisting of 
monomelic units of two types (hydrophobic H and polar 
P) has been proposed by some of the present authors. 
Unlike some other methods of sequence design known in 
the literature (see review and references therein), the 
approach in question does not aim to mimic folding into 
one particular conformation. The goal is to model sim- 
pler and more robust property of proteins, such as their 
ability to stay dissolved and shield their hydrophobic 
monomers from water. The essence of this approach is il- 
lustrated in Fig. 1. We start with an arbitrary computer- 
generated globular conformation of a homopolymer chain 
(formed due to the strong attraction of monomer units, 
Fig. la) and perform a "coloring" procedure: monomer 
units in the core of the globule (having many neighbors) 
are set to be i/-units while monomer units belonging 
to a globular surface (where the number of neighbors is 
smaller) are assigned to be of P-type, Fig. lb. Then 
the obtained primary sequence is fixed, uniform attrac- 
tion of monomer units is removed and newly generated 
_ff P-copolymer is ready for the further investigation (Fig. 
Ic). Thus obtained macromolecules are protein-like in 
the sense that they mimic segregation of globule into hy- 
drophobic core and stabilizing hydrophilic envelope. The 
properties of protein-like copolymers were examined in 
|Q,||J^J^; see also [||J^ on possible ways of experimental 
realization. 

In this Letter, we address correlations between H- and 
P-units along the protein-like sequences. This may shed 
light on the conditions which must be met by the se- 
quence to provide for the water solubility of globules - 
the issue of great potential relevance to our understand- 
ing of early evolution. We show, both by computer simu- 
lations and by exact analytical calculation, that correla- 
tions have a long-range character. More specifically, for 
the simple model of flexible polymer, they belong to the 
so-called Levy flight statistics. 

To begin with, statistical properties of protein-like 



i/P-sequences can be assessed computationally by the 
method similar to that used by Stanley et al in their 
search for long-range correlations in DNA sequences. We 
choose the "window" of length £, move it step by step 
along the generated iJP-sequence, and at each step count 
the number of H units inside the window. This number, 
which we write as X]i=^ ^ random variable, depend- 
ing on the position j of the window along the sequence; 
here Ui is the variable associated with every monomer i 
such, that Ui = 1 if monomer i is H and Ui = if it 
is P. This random variable has certain distribution. Its 
average is determined by the overall sequence composi- 
tion (total numbers of H- and P- monomers), and its 
dispersion is easy to calculate: 



k+e 



(1) 




For a completely random i7P-sequence, the value of Dg 
scales as ^^/^ with the window width L The dependence 
Di ~ £" with a > 1/2 would then manifest the existence 
of long-range correlations. 



FIG. 1. Sequence design scheme for protein-like copoly- 
mers: (a) homopolymer globule; (b) the same globule after 
the coloring procedure; (c) protein-like copolymer in the coil 
state. 

The result of such calculation for averaging over 2000 
independent protein- like iJP-sequences of = 1024 
monomer units with 1 : 1 composition (obtained as in 
ref. lH) is presented in Fig. 2, squares. For compari- 
son, in the same figure 2 the data for two other types 
of sequences (averaged over 2000 independent species) 
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are shown. One of them is purely random 1 : 1 se- 
quence, it demonstrates Di '-^ l^^"^ scahng. Comparing 
this curve with Monte Carlo results we see immediately 
that protein-like sequence is not random and some cor- 
relations do exist in it. Thus, it is interesting to compare 
the squares in Fig. 2 and the dashed curve showing data 
for the sequence which was called " random-block" in refs 
[0,|UJ^: the lengths of H and P-blocks in a sequence 
are determined by the Poisson distributions adjusted to 
achieve the same 1 : 1 composition and the same " degree 
of blockiness" (average block length) as for protein-like 
i/P-copolymer. This sequence exhibits a somewhat more 
rapid variation of Dg at small ^, but ultimately the law 
Di ^ is obeyed for large values of I. Nevertheless, 
this random-block model is also seen to be unsatisfac- 
tory for the statistical behavior of protein-like sequence 
throughout the interval of I examined, 2 < I < 500. Al- 
though the data do not fit accurately to any power law 
-Df ^ the slope of the observed dependence cor- 
responds to a significantly larger than 1/2, up to about 
0.85, thus indicating pronounced long-range correlations 
in protein-like sequence. In what follows we present ana- 
lytical theory which produces curve in Fig. 2 in com- 
plete agreement with observations. 

First of all, let us turn to the origin of long-range cor- 
relations in the primary sequences for _ffP-copolymers 
generated via the procedure illustrated in Fig. 1. Con- 
ceptually, this problem is fairly easy to address: since 
sequence in this scheme is uniquely determined by the 
parent conformation, the statistics of sequences reflects 
nothing but the statistics of parent conformations, which, 
in turn, is well understood. Indeed, the coloring proce- 
dure (Fig. lb) operates in dense globular conformation. 
Since we consider very compact parent conformations, 
the statistics of polymer chain conformations inside the 
globule is ideal (Gaussian) according to the well-known 
Flory theorem [1^]. Therefore, all the statistical proper- 
ties of parental conformations, including the correlations 
in the primary sequences produced by coloring procedure, 
can be derived via the solution of diffusion equation for 
random walks with appropriate boundary conditions. 

To understand the fractal aspect of the sequences, it is 
convenient to concentrate on their uninterrupted homo- 
colored sections and on the points of connection between 
them. Our coloring procedure (Fig. 1) introduces a sep- 
aration sphere of radius R* < R, such that all the units 
which in the parent conformation are confined inside this 
sphere are of ff-type and 

the units belonging to the shell layer R* < r < R arc 
of P-typc. Therefore, homo-colored section is produced 
in our model by the chain section of the parent confor- 
mation placed entirely in either internal or shell regions 
of the globule. The probability to have an uninterrupted 
succession of some k of _ff-monomer units in the sequence 
is equal to the probability that Gaussian (due to Flory 
theorem) polymer has a loop of k monomer units entirely 
confined in i?-region with ends on the separation surface. 
Similarly, the probability to have an uninterrupted suc- 



cession of some k of P-monomer units in the sequence is 
equal to the probability that ideal parental conformation 
has a loop of k monomer units confined within the shell 
P-region, again with ends on the separation surface. 
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FIG. 2. Dispersion of the number of _ff-units in the frag- 
ment of sequence of size £ for protein-like //P- sequence, ran- 
dom copolymer and random-block copolymer. Results of an- 
alytical theory for protein-like sequence are shown both for 
continuous approximation by thick solid line and for discrete 
approximation by thin solid line (see explanation around eq. 
([L Corresponding Monte Carlo results are presented by 
squares. There is no adjustable parameters involved in the 
fit, length scale a is uniquely determined by the geometry of 
bond fluctuation model |^ 

To address probability distributions PH{k) and Pp{k), 
we begin with simple physical arguments yielding 



PH,p{k) 
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The upper asymptotic form is valid for short polymer 
loops, when neither curvature of the separating boundary 
nor overall globule shape play any role. In this regime, 
P(fc) is simply the probability for a random walk to start 
at the planar wall and to return to it for the first time 
after k steps. This is classical probabilistic "first return" 
pro blem, for which the ~ k^^^'^ answer is well known 
||l2| . This scaling is valid for loop sizes a^/k much larger 
than unity but smaller than the relevant characteristic 
length scale, dn = R* for the i7-loops inside the inner 
sphere, oi dp — R — R* for the P-loops in the spher- 
ical shell. The second asymptotic form in equation (|^) 
indicates that for long polymer loops the function P{k) 
decays exponentially. It is easier to explain this in terms 
of polymer statistics: to confine a polymer chain of k 
monomer units in a cavity costs some entropy AS, at 
aVk 3> d this entropy goes linearly with k, making the 
probability, exp(AS'), exponential in k. 
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Let us now look closer at the cross-over values of k. 
In order to achieve the 1 : 1 composition, R* must be 
chosen such that volumes of internal H- and shell P- 
regions are the same, which means R* — 2^^/^R « 0.8i?. 
The volume fraction of polymer units in a globule, 0, 
is controlled by the energy of interactions of monomer 
units used to prepare parental conformation. It is clear 
that R K. Q.QaN^/^(t)-^. Therefore, ff-loops remain m 
the power law long range correlation regime up to the 
length k < O.24A^^/'^0~^, while for P-loops this cross- 
over occurs somewhat earlier: k < 0.015N^^^4'~^ . Thus, 
we predict that there should be over a decade of length 
scales in which iJ-loops are still long range correlated, 
while only short range correlations remain in the P-loops. 

The result (||) is sufficient to explain qualitatively cor- 
relations in protein-like sequences, including the data 
shown in Fig. 2. Indeed, according to our discussion, 
protein-like sequence can be thought of as an alter- 
nating succession of H- and P-stretches, with lengths 
of stretches taken independently from the correspond- 
ing distributions Pf/(A:) and Pp{k). This mathemati- 
cal scheme is called a Levy flight We conclude 
that the long-range correlations in the primary sequences 
of protein-like copolymers are described by Levy flight 
statistics. Furthermore, for the k~^^^ behavior of P(fc), 
the averaged block length diverges, and, therefore, the 
value of Di in the power law regime is controlled by the 
longest block, yielding I?^ ~ ^, or a = 1. This is true as 
long as both H- and P- loops remain in fractal regime. 
On the other hand, when all loops cross-over to exponen- 
tial distribution, Di crosses over to a = 1/2: 



£ , for 1< £ < 0.015iV2/3^ 
£1/2 , for £ > 0.24iV2/3^-2 



(3) 



The cross-over region for Di is very broad, it corresponds 
to the situation in which P loops are already "large, while 
iJ-loops are still "small." Both a = 1 and a = 1/2 limits 
and wide cross-over agree qualitatively well with compu- 
tational data. Fig. 2. This motivates more careful theory, 
in which instead of scaling estimates (||) and (||), the ex- 
pressions in terms of infinite series, suitable for numerical 
calculation, is obtained. 

To develop full analytical theory, it is convenient to 
use the random walk terminology to describe parent con- 
formation. In this language, for instance, Pff(fc) is the 
probability that the random walker enters a sphere of 
the radius R* and then arrives back to the boundary 
for the first time after "time" k. Recall that the statis- 
tical weight of all random walk trajectories starting at 
the point and arriving after k steps at the point r, 
G (f, fc |ro ), obeys the diffusion equation 



dG {f,k\ro] 
dk 
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■AGir,k\ro) + mS{r~ro) (4) 



where is the mean square length of one step (the 
squared size of one monomer unit along the chain). To 
introduce the condition of first return we have to say that 



the walker never touches the boundary, which is achieved 
by imposing the boundary condition 



G(f,fc|fo)||-| 



|r|=_R- 







(5) 



The probability distribution of the "first return times" 
in terms of G is then given as the time-dependent flux of 
diffusing particles through the absorbing wall: 



PHik) 



r dG 








J 6 dr 


r=R* 



(6) 



where d/dr means the component of gradient normal to 
the surface, integration is performed over the closed sep- 
arating surface, and the absolute value is written to avoid 
thinking about the direction of the flux. The normaliza- 
tion condition J PH{k)dk = 1 is guaranteed by the fact 
that all diffusing particles eventually leave through the 
surface. As regards tq: it should be taken within a dis- 
tance of order a from the separating R* surface. The 
problem thus formulated, including equations (|^||), is 
easy to solve: we write G in terms of bilinear expansion 
G = ^„ e'^'^"'0„(r)?/;„(ro) over the eigenf unctions ipn sat- 
isfying (a^ /6)Aipn = ^n^n with the boundary condition 
(^. Upon spherical integration in (^, all angular depen- 
dent harmonics vanish, and we arrive at 



2 ^ 

^ ' 3R*ro ^ ' \ R*J 

n—1 



(7) 



The distribution Pp{k) can be derived similarly, except 
that now we have to take care of the boundary condition 
at the outer surface of the globule. To this end, we argue 
that this condition must be taken in the form 



VpG{r,k\ra)\r=R. = 



(8) 



Indeed, formally this condition ensures the constant den- 
sity of monomer units throughout the globule for large 
values of k, as well as breaking of correlations as soon 
as polymer chain is "reflected" by a globular bound- 
ary. Physically, this boundary condition reflects the fact 
that there is always a "sticky layer" (or depletion layer) 
formed self-consistently along the internal surface of the 
globule due to the effective attraction of monomer units 
to the outer region where polymer density is depleted and 
excluded volume effect is reduced. As long as we are not 
interested in the structure of surface layer of the globule, 
we can just replace this layer by the effective boundary 
condition (0). After calculations for Pp(fc) we obtain 
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X exp 
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(9) 
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where Cn satisfies Cn = {I ~ R*/R) tanC„. 

Finally, to compute the dispersion Dg, we note that 
UiUj in eq. is the probability that both units i and j 
are of H type, which happens if both are located inside 
the R* region in the parental conformation. Thus 
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ji=0 



(10) 



where G (r, fc|ro) is the Green function satisfying eq. (^) 
with the boundary condition -!/;„ and A„ are the cor- 
responding eigenfunctions and eigenvalues. Plugging this 
into the eq. (P, one arrives at the cumbersome looking 
expression for Dg which is easy to implement numeri- 
cally; the result is plotted in Fig. 2 and shows virtually 
perfect fit to the Monte Carlo data |l^. In fact, this fit 
may even be somewhat fortuitous; indeed, along with the 
diffusion equation (^), which is an approximation for the 
underlying bond fluctuation model, we can also switch 
from summation to integration in eq. dll), yielding 



V6i?2 



(11) 



where ^„ satisfies the equation ^„ — tan^„, 

2 



^cosS^fl 



, and the function g is 
-x))lx^. (Note that the 



defined as = 2(a; — 1 -I- exp(- 

sum in eq. (yjj) starts from n = 1 and does not include 
the ground state, for which ipo is a constant). It is easy 
to check that equation (|ll| ) does indeed have asymptotic 
behavior in accord with (pf), including a broad cross-over; 
similarly, eqs. (0,^ agree with (^). Numerically, eq. ( |ll| ) 
fits pretty well to the data (Fig. 2), but, we repeat, best 
fit is achieved by the cumbersome discrete formula. 

In conclusion, we have shown that protein-like copoly- 
mers generated according to the coloring procedure pro- 
posed earlier exhibit long-range correlations in the 
primary sequences. For the flexible polymers and large 
enough globules, these correlations belong to the Levy- 
flight statistics. This result, first observed in computer 
experiments, is confirmed and explained by analytical 
calculation. Analytical theory suggests that Levy fiight 
statistics, albeit with a broader cross-over region, is ex- 
pected even if parental conformation is not maximally 
compact, but rather a globule somewhat closer to the 
(?-point. It becomes clear from our model that segrega- 
tion of globule into a hydrophobic core and a hydrophilic 
peel, which is the necessary condition for water solu- 
bility, does impose severe restrictions on the sequence, 
and, therefore, must be manifested in certain correla- 
tions. Qualitativeh^ this agrees with earlier study of 
protein sequences |14| . However, precise form of corre- 
lations may be affected by both globule size and chain 
flexibility, including the aspect of secondary structure. 



Moreover, another effect exists which favors correlations 
of the opposite sign, and which dominates for small glob- 
ules and/or rigid polymers Identification of long 
range correlations in protein sequences becomes, there- 
fore, an interesting task promising to shed light on the 
evolutionary criteria involved in the selection of proteins 
and the role of water solubility among them. 
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